{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcc3 Solution for Exercise 01\n",
    "\n",
    "The aim of this exercise is to highlight caveats to have in mind when using\n",
    "feature selection. You have to be extremely careful regarding the set of data\n",
    "on which you will compute the statistic that helps your feature selection\n",
    "algorithm to decide which feature to select.\n",
    "\n",
    "On purpose, we will make you program the wrong way of doing feature selection\n",
    "to gain insights.\n",
    "\n",
    "First, you will create a completely random dataset using NumPy. Using the\n",
    "function `np.random.randn`, generate a matrix `data` containing 100 samples\n",
    "and 100,000 features. Then, using the function `np.random.randint`, generate a\n",
    "vector `target` with 100 samples containing either 0 or 1.\n",
    "\n",
    "This type of dimensionality is typical in bioinformatics when dealing with\n",
    "RNA-seq. However, we will use completely randomized features such that we\n",
    "don't have a link between the data and the target. Thus, the generalization\n",
    "performance of any machine-learning model should not perform better than the\n",
    "chance-level."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# solution\n",
    "rng = np.random.RandomState(42)\n",
    "data, target = rng.randn(100, 100000), rng.randint(0, 2, size=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, create a logistic regression model and use cross-validation to check the\n",
    "score of such a model. It will allow use to confirm that our model cannot\n",
    "predict anything meaningful from random data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "# solution\n",
    "model = LogisticRegression()\n",
    "test_score = cross_val_score(model, data, target, n_jobs=2)\n",
    "print(f\"The mean accuracy is: {test_score.mean():.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "It is not surprising that the logistic regression model performs as bad as\n",
    "pure chance when we provide the full dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we will ask you to program the **wrong** pattern to select feature.\n",
    "Select the feature by using the entire dataset. We will choose ten features\n",
    "with the highest ANOVA F-score computed on the full dataset. Subsequently,\n",
    "subsample the dataset `data` by selecting the features' subset. Finally, train\n",
    "and test a logistic regression model.\n",
    "\n",
    "You should get some surprising results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_selection import SelectKBest, f_classif\n",
    "\n",
    "# solution\n",
    "feature_selector = SelectKBest(score_func=f_classif, k=10)\n",
    "data_subset = feature_selector.fit_transform(data, target)\n",
    "test_score = cross_val_score(model, data_subset, target)\n",
    "print(f\"The mean accuracy is: {test_score.mean():.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "Surprisingly, the logistic regression succeeded in having a fantastic accuracy\n",
    "using data that did not have any link with the target in the first place. We\n",
    "therefore know that these results are not legit.\n",
    "\n",
    "The reasons for obtaining these results are two folds: the pool of available\n",
    "features is large compared to the number of samples. It is possible to find a\n",
    "subset of features that will link the data and the target. By not splitting\n",
    "the data, we leak knowledge from the entire dataset and could use this\n",
    "knowledge while evaluating our model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we will make you program the **right** way to do the feature selection.\n",
    "First, split the dataset into a training and testing set. Then, fit the\n",
    "feature selector on the training set. Then, transform both the training and\n",
    "testing sets before you train and test the logistic regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# solution\n",
    "data_train, data_test, target_train, target_test = train_test_split(\n",
    "    data, target, random_state=0\n",
    ")\n",
    "feature_selector.fit(data_train, target_train)\n",
    "data_train_subset = feature_selector.transform(data_train)\n",
    "data_test_subset = feature_selector.transform(data_test)\n",
    "model.fit(data_train_subset, target_train)\n",
    "test_score = model.score(data_test_subset, target_test)\n",
    "print(f\"The mean accuracy is: {test_score:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "It is not a surprise that our model is not working. We see that selecting\n",
    "features only on the training set will not help when testing our model. In\n",
    "this case, we obtained the expected results.\n",
    "\n",
    "Therefore, as with hyperparameters optimization or model selection, tuning the\n",
    "feature space should be done solely on the training set, keeping a part of the\n",
    "data left-out.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, the previous case is not perfect. For instance, if we were asking to\n",
    "perform cross-validation, the manual `fit`/`transform` of the datasets will\n",
    "make our life hard. Indeed, the solution here is to use a scikit-learn\n",
    "pipeline in which the feature selection will be a pre processing stage before\n",
    "to train the model.\n",
    "\n",
    "Thus, start by creating a pipeline with the feature selector and the logistic\n",
    "regression. Then, use cross-validation to get an estimate of the uncertainty\n",
    "of your model generalization performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import make_pipeline\n",
    "\n",
    "# solution\n",
    "model = make_pipeline(feature_selector, LogisticRegression())\n",
    "test_score = cross_val_score(model, data, target)\n",
    "print(f\"The mean accuracy is: {test_score.mean():.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "We see that using a scikit-learn pipeline removes a lot of boilerplate code\n",
    "and helps avoid mistakes when calling `fit` and `transform` on the different\n",
    "set of data."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}